O objetivo desse notebook é efetuar todo o processo de modelagem da base de dados adult, disponibilizada para o desafio do curso de introdução ao Machine Learning da Curso-R, utilizando o framework tidymodels. Ou seja, explorar, tratar, preparar, tunnar e escolher o modelo que melhor se ajusta aos dados disponibilizados.
adult <- read_rds("adult.rds")
# head(adult)
# glimpse(adult)
skim(adult)
-- Data Summary ------------------------
Values
Name adult
Number of rows 32561
Number of columns 16
_______________________
Column type frequency:
character 9
numeric 7
________________________
Group variables None
-- Variable type: character ----------------------------------------------------------------------------------------
# A tibble: 9 x 8
skim_variable n_missing complete_rate min max empty n_unique whitespace
* <chr> <int> <dbl> <int> <int> <int> <int> <int>
1 workclass 1836 0.944 7 16 0 8 0
2 education 0 1 3 12 0 16 0
3 marital_status 0 1 7 21 0 7 0
4 occupation 1843 0.943 5 17 0 14 0
5 relationship 0 1 4 14 0 6 0
6 race 0 1 5 18 0 5 0
7 sex 0 1 4 6 0 2 0
8 native_country 583 0.982 4 26 0 41 0
9 resposta 0 1 4 5 0 2 0
-- Variable type: numeric ------------------------------------------------------------------------------------------
# A tibble: 7 x 11
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 age 0 1 38.6 13.6 17 28 37 48 90 ▇▇▅▂▁
2 fnlwgt 0 1 189778. 105550. 12285 117827 178356 237051 1484705 ▇▁▁▁▁
3 education_num 0 1 10.1 2.57 1 9 10 12 16 ▁▁▇▃▁
4 capital_gain 0 1 1078. 7385. 0 0 0 0 99999 ▇▁▁▁▁
5 capital_loss 0 1 87.3 403. 0 0 0 0 4356 ▇▁▁▁▁
6 hours_per_week 0 1 40.4 12.3 1 40 40 45 99 ▁▇▃▁▁
7 id 0 1 16281 9400. 1 8141 16281 24421 32561 ▇▇▇▇▇
As variáveis parecem estar com formatos corretos. Ponto de atenção para as variáveis wokclass, occupation e native_country, que apresentam valores missing.
Agora vamos analisar o comportamento das variáveis para definirmos como tratar os nossos dados para o modelo.
# DataExplorer::create_report(adult)
devtools::source_url("https://raw.githubusercontent.com/ricardomattos05/functions/master/function_AED_bivariada.R")
#
#
adult2 <- adult %>%
select(-id) %>%
mutate(resposta = if_else(resposta == ">50K", 1, 0))
#
#
# names(adult2)
for (i in 1:(length(adult2)-1) ) {
df <- adult2[,c(i,15)]
cat("### ",names(df[,1]),"\n")
print(AED_biv(df,glue("resposta"),"Pre"))
cat('\n\n')
}
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
Observações:
education : é possível visualizar que quanto maior o grau de escolaridade, maior a proporção de pessoas com salarios acima de 50k. E que as categorias abaixo de HS-grad, 1th-4th até 12thalém de serem pouco representativas, possuem baixa proporção, vamos então criar uma categoria uma nova consolidando elas HS-not-grad.
marital_status : aqui iremos agrupar os campos Married-AF-spouse e Married-civ-spouse, criando a categoria Married, baseado na similaridade entre elas com relação a variável resposta e considerando a descrição delas.
native_country : É um campo com pouca variabilidade, onde 90% dos dados estão atribuídos como “Estados Unidos”. Sendo assim, poderia considerar apenas Estados Unidos e agrupar o restante como outros, mas vamos manter o máximo de informação e reduzir as categorias para 3, agrupando todos os países que obtiveram proporção maior que a média, manter o valor mais representativo e uma categoria com os países abaixo da média.
relationship : campo contém os campos husband e wife, aparentemente poderiamos agrupa-los, vamos analisar mais afundo.
capital_loss e capital_gain : Aparentemente tanto quem ganha quanto quem perde algum valor apresentam maiores probabilidades de ter salario >50k. Vamos então avaliar a correlação entre elas.
workclass : Categorias com baixa representatividade como Never-workede Without-pay não possuem classificação com a resposta de interesse “>50k”, vamos dar um zoom nessa variável e analisar os NA’s que identificamos também.
ggplot(adult, aes(x = occupation, fill = resposta)) +
geom_bar(position="fill") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("occupation")
É possível ver que não faria sentido atribuir os NAs de forma modal, uma vez que nosso objetivo é obter o maior poder preditivo possível, logo, não queremos perder informação. Sendo assim, não vamos diluir os NAs na categoria com maior representatividade Prof-specialty, vamos atribuir à uma categoria com proporções similares e que possui uma boa representatividade, Farming-fishing.
ggplot(adult, aes(x = relationship)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("relationship")
ggplot(adult, aes(x = relationship, fill = resposta)) +
geom_bar(position="fill") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("relationship")
ggplot(adult, aes(x = relationship, fill = sex)) +
geom_bar(position="fill") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("relationship")
Vamos então balancear o gênero agrupando as categorias Wife e Husband, criando a categoria Married.
ggplot(adult, aes(x= capital_gain, y= capital_loss)) +
geom_point()
sum(adult$capital_loss > 0 & adult$capital_gain > 0)
[1] 0
Sendo assim, podemos soma-las e criar a variável capital_total sem medo de perder informação.
ggplot(adult, aes(x = workclass)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Workclass")
ggplot(adult, aes(x = workclass, fill = resposta)) +
geom_bar(position="fill") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Workclass")
Pelo visto a catgoria NA possui relação com a variável resposta distinta de todas as outras categorias, vamos então gerar uma nova categoria not-identify para atribuir os valores NA.
med <- (adult %>%
select(resposta) %>%
filter(resposta == ">50K") %>%
count() %>%
as.numeric())/nrow(adult)
tb_country<- adult %>%
select(native_country, resposta) %>%
group_by(native_country) %>%
count(resposta) %>%
mutate(prop = prop.table(n)) %>%
filter(resposta == ">50K") %>%
mutate( class = case_when( native_country == "United-States" ~ "United-States",
prop > med ~ ">mean",
prop <= med ~ "<=mean" ) )
tb_country %>%
select(native_country,class) %>%
group_by(class) %>%
count()
NA
NA
NA
Ficamos então com 21 países com proporções abaixo da méda, 18 acima e “United-States” como as 3 categorias restantes.
A distribuição ficou com 5% para países acima da média e 5% para países abaixo da média.
Com nossa a análise exploratória concluída, vamos dar início as estapas da modelagem utilizando o framework do tidymodels.
Fazendo a separação dos dados em treino e teste, estratificando pela variável resposta para a modelagem.
set.seed(32)
adult_split <- initial_split(adult, prop = 0.8, strata = resposta)
adult_train <- training(adult_split)
adult_test <- testing(adult_split)
Os tratamentos necessários observados na AED, que foi feita utilizando o pacote DataExplorer e a função AED_biv que gerei para entender o comportamento das variáveis com relação a variável resposta, serão armazenados utilizando o recipes para ser utilizado tanto para treinar os modelos como para testar posteriormente.
Especificando a validação cruzada:
set.seed(32)
adult_vfold <- vfold_cv(adult_train, v = 5, strata = resposta)
adult_vfold
# 5-fold cross-validation using stratification
Os modelos que serão ajustados:
Obs: Os valores dos hiperparâmetros foram obtidos a partir da tunagem e inseridos apenas para otimizar o tempo de renderização do script.
Especificando modelo:
#1.069415e-09 8 19 Model04
adult_tree
Decision Tree Model Specification (classification)
Main Arguments:
cost_complexity = 1.069415e-09
tree_depth = 8
min_n = 19
Computational engine: rpart
Workflow para decision tree:
workflow_adult_tree
== Workflow ====================================================================
Preprocessor: Recipe
Model: decision_tree()
-- Preprocessor ----------------------------------------------------------------
7 Recipe Steps
* step_mutate()
* step_rm()
* step_string2factor()
* step_normalize()
* step_zv()
* step_novel()
* step_dummy()
-- Model -----------------------------------------------------------------------
Decision Tree Model Specification (classification)
Main Arguments:
cost_complexity = 1.069415e-09
tree_depth = 8
min_n = 19
Computational engine: rpart
Parâmentros:
hiperparams <- parameters(
adult_tree
)
hiperparams
Collection of 3 parameters for tuning
id parameter type object class
cost_complexity cost_complexity nparam[+]
tree_depth tree_depth nparam[+]
min_n min_n nparam[+]
Grid:
Efetuando tunagem de hiperparâmetros:
Finalizando WF:
workflow_tree_final
== Workflow ====================================================================
Preprocessor: Recipe
Model: decision_tree()
-- Preprocessor ----------------------------------------------------------------
7 Recipe Steps
* step_mutate()
* step_rm()
* step_string2factor()
* step_normalize()
* step_zv()
* step_novel()
* step_dummy()
-- Model -----------------------------------------------------------------------
Decision Tree Model Specification (classification)
Main Arguments:
cost_complexity = 1.069415e-09
tree_depth = 8
min_n = 19
Computational engine: rpart
Verificando importância dos atributos:
Modelo final:
Especificando modelo:
# 23 1715 21
adult_rf
Random Forest Model Specification (classification)
Main Arguments:
mtry = 23
trees = 1715
min_n = 21
Computational engine: randomForest
Workflow para random forest:
workflow_adult_rf <-
adult_wf %>%
add_model(adult_rf)
Grid:
parameters(adult_rf)
Collection of 0 parameters for tuning
[1] id parameter type object class
<0 linhas> (ou row.names de comprimento 0)
Efetuando tunagem de hiperparâmetros:
set.seed(123)
rf_tune<-
workflow_adult_rf %>%
tune_grid(
resamples = adult_vfold,
grid = rf_grid,
control = control_grid(save_pred = TRUE, verbose = T, allow_par = T),
metrics = metric_set(roc_auc)
)
i Fold1: recipe
v Fold1: recipe
i Fold1: model 1/10
v Fold1: model 1/10
i Fold1: model 1/10 (predictions)
i Fold1: model 2/10
v Fold1: model 2/10
i Fold1: model 2/10 (predictions)
i Fold1: model 3/10
v Fold1: model 3/10
i Fold1: model 3/10 (predictions)
i Fold1: model 4/10
v Fold1: model 4/10
i Fold1: model 4/10 (predictions)
i Fold1: model 5/10
v Fold1: model 5/10
i Fold1: model 5/10 (predictions)
i Fold1: model 6/10
v Fold1: model 6/10
i Fold1: model 6/10 (predictions)
i Fold1: model 7/10
v Fold1: model 7/10
i Fold1: model 7/10 (predictions)
i Fold1: model 8/10
v Fold1: model 8/10
i Fold1: model 8/10 (predictions)
i Fold1: model 9/10
v Fold1: model 9/10
i Fold1: model 9/10 (predictions)
i Fold1: model 10/10
v Fold1: model 10/10
i Fold1: model 10/10 (predictions)
i Fold2: recipe
v Fold2: recipe
i Fold2: model 1/10
v Fold2: model 1/10
i Fold2: model 1/10 (predictions)
i Fold2: model 2/10
v Fold2: model 2/10
i Fold2: model 2/10 (predictions)
i Fold2: model 3/10
v Fold2: model 3/10
i Fold2: model 3/10 (predictions)
i Fold2: model 4/10
v Fold2: model 4/10
i Fold2: model 4/10 (predictions)
i Fold2: model 5/10
v Fold2: model 5/10
i Fold2: model 5/10 (predictions)
i Fold2: model 6/10
v Fold2: model 6/10
i Fold2: model 6/10 (predictions)
i Fold2: model 7/10
v Fold2: model 7/10
i Fold2: model 7/10 (predictions)
i Fold2: model 8/10
v Fold2: model 8/10
i Fold2: model 8/10 (predictions)
i Fold2: model 9/10
v Fold2: model 9/10
i Fold2: model 9/10 (predictions)
i Fold2: model 10/10
v Fold2: model 10/10
i Fold2: model 10/10 (predictions)
i Fold3: recipe
v Fold3: recipe
i Fold3: model 1/10
v Fold3: model 1/10
i Fold3: model 1/10 (predictions)
i Fold3: model 2/10
v Fold3: model 2/10
i Fold3: model 2/10 (predictions)
i Fold3: model 3/10
v Fold3: model 3/10
i Fold3: model 3/10 (predictions)
i Fold3: model 4/10
v Fold3: model 4/10
i Fold3: model 4/10 (predictions)
i Fold3: model 5/10
v Fold3: model 5/10
i Fold3: model 5/10 (predictions)
i Fold3: model 6/10
v Fold3: model 6/10
i Fold3: model 6/10 (predictions)
i Fold3: model 7/10
v Fold3: model 7/10
i Fold3: model 7/10 (predictions)
i Fold3: model 8/10
v Fold3: model 8/10
i Fold3: model 8/10 (predictions)
i Fold3: model 9/10
v Fold3: model 9/10
i Fold3: model 9/10 (predictions)
i Fold3: model 10/10
v Fold3: model 10/10
i Fold3: model 10/10 (predictions)
i Fold4: recipe
v Fold4: recipe
i Fold4: model 1/10
v Fold4: model 1/10
i Fold4: model 1/10 (predictions)
i Fold4: model 2/10
v Fold4: model 2/10
i Fold4: model 2/10 (predictions)
i Fold4: model 3/10
v Fold4: model 3/10
i Fold4: model 3/10 (predictions)
i Fold4: model 4/10
v Fold4: model 4/10
i Fold4: model 4/10 (predictions)
i Fold4: model 5/10
v Fold4: model 5/10
i Fold4: model 5/10 (predictions)
i Fold4: model 6/10
v Fold4: model 6/10
i Fold4: model 6/10 (predictions)
i Fold4: model 7/10
v Fold4: model 7/10
i Fold4: model 7/10 (predictions)
i Fold4: model 8/10
v Fold4: model 8/10
i Fold4: model 8/10 (predictions)
i Fold4: model 9/10
v Fold4: model 9/10
i Fold4: model 9/10 (predictions)
i Fold4: model 10/10
v Fold4: model 10/10
i Fold4: model 10/10 (predictions)
i Fold5: recipe
v Fold5: recipe
i Fold5: model 1/10
v Fold5: model 1/10
i Fold5: model 1/10 (predictions)
i Fold5: model 2/10
v Fold5: model 2/10
i Fold5: model 2/10 (predictions)
i Fold5: model 3/10
v Fold5: model 3/10
i Fold5: model 3/10 (predictions)
i Fold5: model 4/10
v Fold5: model 4/10
i Fold5: model 4/10 (predictions)
i Fold5: model 5/10
v Fold5: model 5/10
i Fold5: model 5/10 (predictions)
i Fold5: model 6/10
v Fold5: model 6/10
i Fold5: model 6/10 (predictions)
i Fold5: model 7/10
v Fold5: model 7/10
i Fold5: model 7/10 (predictions)
i Fold5: model 8/10
v Fold5: model 8/10
i Fold5: model 8/10 (predictions)
i Fold5: model 9/10
v Fold5: model 9/10
i Fold5: model 9/10 (predictions)
i Fold5: model 10/10
v Fold5: model 10/10
i Fold5: model 10/10 (predictions)
rf_best_hiperparams <- select_best(rf_tune)
Error in .get_tune_metric_names(x) : objeto 'rf_tune' não encontrado
Finalizando WF:
workflow_rf_final
== Workflow ====================================================================
Preprocessor: Recipe
Model: rand_forest()
-- Preprocessor ----------------------------------------------------------------
7 Recipe Steps
* step_mutate()
* step_rm()
* step_string2factor()
* step_normalize()
* step_zv()
* step_novel()
* step_dummy()
-- Model -----------------------------------------------------------------------
Random Forest Model Specification (classification)
Main Arguments:
mtry = 23
trees = 1715
min_n = 21
Computational engine: randomForest
Verificando importância dos atributos:
Modelo final:
Como o Xgboost possui muitos parâmetros, optei pora não tunnar os parâmetros loss_reduction e samples_size nesse primeiro momento. Sendo assim os valores default da enginee xgboost são atribuídos à esses parâmetros, loss_reduction = 0 e sample_size = 1.
adult_xgb
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 34
trees = 1309
min_n = 5
tree_depth = 10
learn_rate = 0.0106
Computational engine: xgboost
Workflow para Xgboost:
workflow_adult_xgb
== Workflow ====================================================================
Preprocessor: Recipe
Model: boost_tree()
-- Preprocessor ----------------------------------------------------------------
7 Recipe Steps
* step_mutate()
* step_rm()
* step_string2factor()
* step_normalize()
* step_zv()
* step_novel()
* step_dummy()
-- Model -----------------------------------------------------------------------
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 34
trees = 1309
min_n = 5
tree_depth = 10
learn_rate = 0.0106
Computational engine: xgboost
Grid:
xgb_grid <- parameters(adult_xgb) %>%
finalize(bake(prep(adult_recipe),adult_train)) %>%
grid_max_entropy(size = 20)
Erro: At least one parameter object is required.
Efetuando tunagem de hiperparâmetros:
library(doFuture)
all_cores <- parallel::detectCores(logical = FALSE) - 1
registerDoFuture()
cl <- makeCluster(all_cores)
plan(future::cluster, workers = cl)
getDoParWorkers()
[1] 3
# grid search
ini <- Sys.time()
xgb_tune <-
workflow_adult_xgb %>%
tune_grid(
resamples = adult_vfold,
grid = xgb_grid,
control = control_grid(verbose = TRUE),
metrics = metric_set(roc_auc)
)
Warning in x :
encerrando conexão não utilizada 6 (<-WNB027899SPO.ciandt.global:11937)
Warning in x :
encerrando conexão não utilizada 5 (<-WNB027899SPO.ciandt.global:11937)
Warning in x :
encerrando conexão não utilizada 4 (<-WNB027899SPO.ciandt.global:11937)
i Fold1: recipe
v Fold1: recipe
i Fold1: model 1/20
v Fold1: model 1/20
i Fold1: model 1/20 (predictions)
i Fold1: model 2/20
v Fold1: model 2/20
i Fold1: model 2/20 (predictions)
i Fold1: model 3/20
v Fold1: model 3/20
i Fold1: model 3/20 (predictions)
i Fold1: model 4/20
v Fold1: model 4/20
i Fold1: model 4/20 (predictions)
i Fold1: model 5/20
v Fold1: model 5/20
i Fold1: model 5/20 (predictions)
i Fold1: model 6/20
v Fold1: model 6/20
i Fold1: model 6/20 (predictions)
i Fold1: model 7/20
v Fold1: model 7/20
i Fold1: model 7/20 (predictions)
i Fold1: model 8/20
v Fold1: model 8/20
i Fold1: model 8/20 (predictions)
i Fold1: model 9/20
v Fold1: model 9/20
i Fold1: model 9/20 (predictions)
i Fold1: model 10/20
v Fold1: model 10/20
i Fold1: model 10/20 (predictions)
i Fold1: model 11/20
v Fold1: model 11/20
i Fold1: model 11/20 (predictions)
i Fold1: model 12/20
v Fold1: model 12/20
i Fold1: model 12/20 (predictions)
i Fold1: model 13/20
v Fold1: model 13/20
i Fold1: model 13/20 (predictions)
i Fold1: model 14/20
v Fold1: model 14/20
i Fold1: model 14/20 (predictions)
i Fold1: model 15/20
v Fold1: model 15/20
i Fold1: model 15/20 (predictions)
i Fold1: model 16/20
v Fold1: model 16/20
i Fold1: model 16/20 (predictions)
i Fold1: model 17/20
v Fold1: model 17/20
i Fold1: model 17/20 (predictions)
i Fold1: model 18/20
v Fold1: model 18/20
i Fold1: model 18/20 (predictions)
i Fold1: model 19/20
v Fold1: model 19/20
i Fold1: model 19/20 (predictions)
i Fold1: model 20/20
v Fold1: model 20/20
i Fold1: model 20/20 (predictions)
i Fold2: recipe
v Fold2: recipe
i Fold2: model 1/20
v Fold2: model 1/20
i Fold2: model 1/20 (predictions)
i Fold2: model 2/20
v Fold2: model 2/20
i Fold2: model 2/20 (predictions)
i Fold2: model 3/20
v Fold2: model 3/20
i Fold2: model 3/20 (predictions)
i Fold2: model 4/20
v Fold2: model 4/20
i Fold2: model 4/20 (predictions)
i Fold2: model 5/20
v Fold2: model 5/20
i Fold2: model 5/20 (predictions)
i Fold2: model 6/20
v Fold2: model 6/20
i Fold2: model 6/20 (predictions)
i Fold2: model 7/20
v Fold2: model 7/20
i Fold2: model 7/20 (predictions)
i Fold2: model 8/20
v Fold2: model 8/20
i Fold2: model 8/20 (predictions)
i Fold2: model 9/20
v Fold2: model 9/20
i Fold2: model 9/20 (predictions)
i Fold2: model 10/20
v Fold2: model 10/20
i Fold2: model 10/20 (predictions)
i Fold2: model 11/20
v Fold2: model 11/20
i Fold2: model 11/20 (predictions)
i Fold2: model 12/20
v Fold2: model 12/20
i Fold2: model 12/20 (predictions)
i Fold2: model 13/20
v Fold2: model 13/20
i Fold2: model 13/20 (predictions)
i Fold2: model 14/20
v Fold2: model 14/20
i Fold2: model 14/20 (predictions)
i Fold2: model 15/20
v Fold2: model 15/20
i Fold2: model 15/20 (predictions)
i Fold2: model 16/20
v Fold2: model 16/20
i Fold2: model 16/20 (predictions)
i Fold2: model 17/20
v Fold2: model 17/20
i Fold2: model 17/20 (predictions)
i Fold2: model 18/20
v Fold2: model 18/20
i Fold2: model 18/20 (predictions)
i Fold2: model 19/20
v Fold2: model 19/20
i Fold2: model 19/20 (predictions)
i Fold2: model 20/20
v Fold2: model 20/20
i Fold2: model 20/20 (predictions)
i Fold3: recipe
v Fold3: recipe
i Fold3: model 1/20
v Fold3: model 1/20
i Fold3: model 1/20 (predictions)
i Fold3: model 2/20
v Fold3: model 2/20
i Fold3: model 2/20 (predictions)
i Fold3: model 3/20
v Fold3: model 3/20
i Fold3: model 3/20 (predictions)
i Fold3: model 4/20
v Fold3: model 4/20
i Fold3: model 4/20 (predictions)
i Fold3: model 5/20
v Fold3: model 5/20
i Fold3: model 5/20 (predictions)
i Fold3: model 6/20
v Fold3: model 6/20
i Fold3: model 6/20 (predictions)
i Fold3: model 7/20
v Fold3: model 7/20
i Fold3: model 7/20 (predictions)
i Fold3: model 8/20
v Fold3: model 8/20
i Fold3: model 8/20 (predictions)
i Fold3: model 9/20
v Fold3: model 9/20
i Fold3: model 9/20 (predictions)
i Fold3: model 10/20
v Fold3: model 10/20
i Fold3: model 10/20 (predictions)
i Fold3: model 11/20
v Fold3: model 11/20
i Fold3: model 11/20 (predictions)
i Fold3: model 12/20
v Fold3: model 12/20
i Fold3: model 12/20 (predictions)
i Fold3: model 13/20
v Fold3: model 13/20
i Fold3: model 13/20 (predictions)
i Fold3: model 14/20
v Fold3: model 14/20
i Fold3: model 14/20 (predictions)
i Fold3: model 15/20
v Fold3: model 15/20
i Fold3: model 15/20 (predictions)
i Fold3: model 16/20
v Fold3: model 16/20
i Fold3: model 16/20 (predictions)
i Fold3: model 17/20
v Fold3: model 17/20
i Fold3: model 17/20 (predictions)
i Fold3: model 18/20
v Fold3: model 18/20
i Fold3: model 18/20 (predictions)
i Fold3: model 19/20
v Fold3: model 19/20
i Fold3: model 19/20 (predictions)
i Fold3: model 20/20
v Fold3: model 20/20
i Fold3: model 20/20 (predictions)
i Fold4: recipe
v Fold4: recipe
i Fold4: model 1/20
v Fold4: model 1/20
i Fold4: model 1/20 (predictions)
i Fold4: model 2/20
v Fold4: model 2/20
i Fold4: model 2/20 (predictions)
i Fold4: model 3/20
v Fold4: model 3/20
i Fold4: model 3/20 (predictions)
i Fold4: model 4/20
v Fold4: model 4/20
i Fold4: model 4/20 (predictions)
i Fold4: model 5/20
v Fold4: model 5/20
i Fold4: model 5/20 (predictions)
i Fold4: model 6/20
v Fold4: model 6/20
i Fold4: model 6/20 (predictions)
i Fold4: model 7/20
v Fold4: model 7/20
i Fold4: model 7/20 (predictions)
i Fold4: model 8/20
v Fold4: model 8/20
i Fold4: model 8/20 (predictions)
i Fold4: model 9/20
v Fold4: model 9/20
i Fold4: model 9/20 (predictions)
i Fold4: model 10/20
v Fold4: model 10/20
i Fold4: model 10/20 (predictions)
i Fold4: model 11/20
v Fold4: model 11/20
i Fold4: model 11/20 (predictions)
i Fold4: model 12/20
v Fold4: model 12/20
i Fold4: model 12/20 (predictions)
i Fold4: model 13/20
v Fold4: model 13/20
i Fold4: model 13/20 (predictions)
i Fold4: model 14/20
v Fold4: model 14/20
i Fold4: model 14/20 (predictions)
i Fold4: model 15/20
v Fold4: model 15/20
i Fold4: model 15/20 (predictions)
i Fold4: model 16/20
v Fold4: model 16/20
i Fold4: model 16/20 (predictions)
i Fold4: model 17/20
v Fold4: model 17/20
i Fold4: model 17/20 (predictions)
i Fold4: model 18/20
v Fold4: model 18/20
i Fold4: model 18/20 (predictions)
i Fold4: model 19/20
v Fold4: model 19/20
i Fold4: model 19/20 (predictions)
i Fold4: model 20/20
v Fold4: model 20/20
i Fold4: model 20/20 (predictions)
i Fold5: recipe
v Fold5: recipe
i Fold5: model 1/20
v Fold5: model 1/20
i Fold5: model 1/20 (predictions)
i Fold5: model 2/20
v Fold5: model 2/20
i Fold5: model 2/20 (predictions)
i Fold5: model 3/20
v Fold5: model 3/20
i Fold5: model 3/20 (predictions)
i Fold5: model 4/20
v Fold5: model 4/20
i Fold5: model 4/20 (predictions)
i Fold5: model 5/20
v Fold5: model 5/20
i Fold5: model 5/20 (predictions)
i Fold5: model 6/20
v Fold5: model 6/20
i Fold5: model 6/20 (predictions)
i Fold5: model 7/20
v Fold5: model 7/20
i Fold5: model 7/20 (predictions)
i Fold5: model 8/20
v Fold5: model 8/20
i Fold5: model 8/20 (predictions)
i Fold5: model 9/20
v Fold5: model 9/20
i Fold5: model 9/20 (predictions)
i Fold5: model 10/20
v Fold5: model 10/20
i Fold5: model 10/20 (predictions)
i Fold5: model 11/20
v Fold5: model 11/20
i Fold5: model 11/20 (predictions)
i Fold5: model 12/20
v Fold5: model 12/20
i Fold5: model 12/20 (predictions)
i Fold5: model 13/20
v Fold5: model 13/20
i Fold5: model 13/20 (predictions)
i Fold5: model 14/20
v Fold5: model 14/20
i Fold5: model 14/20 (predictions)
i Fold5: model 15/20
v Fold5: model 15/20
i Fold5: model 15/20 (predictions)
i Fold5: model 16/20
v Fold5: model 16/20
i Fold5: model 16/20 (predictions)
i Fold5: model 17/20
v Fold5: model 17/20
i Fold5: model 17/20 (predictions)
i Fold5: model 18/20
v Fold5: model 18/20
i Fold5: model 18/20 (predictions)
i Fold5: model 19/20
v Fold5: model 19/20
i Fold5: model 19/20 (predictions)
i Fold5: model 20/20
v Fold5: model 20/20
i Fold5: model 20/20 (predictions)
Sys.time()- ini #Time difference of 39.9844 mins(parallel)
Time difference of 48.61914 mins
foreach::registerDoSEQ()
xgb_best_hiperparams
Erro: objeto 'xgb_best_hiperparams' não encontrado
Finalizando WF:
workflow_xgb_final
== Workflow ====================================================================
Preprocessor: Recipe
Model: boost_tree()
-- Preprocessor ----------------------------------------------------------------
7 Recipe Steps
* step_mutate()
* step_rm()
* step_string2factor()
* step_normalize()
* step_zv()
* step_novel()
* step_dummy()
-- Model -----------------------------------------------------------------------
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 34
trees = 1309
min_n = 5
tree_depth = 10
learn_rate = 0.0106
Computational engine: xgboost
Verificando importância dos atributos:
workflow_xgb_final %>%
fit(adult_train) %>%
pull_workflow_fit() %>%
vip::vip(geom = "col")
`as.tibble()` is deprecated as of tibble 2.0.0.
Please use `as_tibble()` instead.
The signature and semantics have changed, see `?as_tibble`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
Modelo final:
Agora vamos inserir os valores identificados na tunagem para os parâmetros e efetuar o tuning para os parâmetros que sample_size e loss_reduction:
adult_xgb2
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 34
trees = 1309
min_n = 5
tree_depth = 10
learn_rate = 0.0106445
loss_reduction = 0.000127
sample_size = 0.989
Computational engine: xgboost
Workflow para Xgboost:
workflow_adult_xgb2
== Workflow ====================================================================
Preprocessor: Recipe
Model: boost_tree()
-- Preprocessor ----------------------------------------------------------------
7 Recipe Steps
* step_mutate()
* step_rm()
* step_string2factor()
* step_normalize()
* step_zv()
* step_novel()
* step_dummy()
-- Model -----------------------------------------------------------------------
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 34
trees = 1309
min_n = 5
tree_depth = 10
learn_rate = 0.0106445
loss_reduction = tune()
sample_size = tune()
Computational engine: xgboost
Grid:
Efetuando tunagem de hiperparâmetros:
getDoParWorkers()
[1] 3
Finalizando WF:
workflow_xgb_final2
== Workflow ====================================================================
Preprocessor: Recipe
Model: boost_tree()
-- Preprocessor ----------------------------------------------------------------
7 Recipe Steps
* step_mutate()
* step_rm()
* step_string2factor()
* step_normalize()
* step_zv()
* step_novel()
* step_dummy()
-- Model -----------------------------------------------------------------------
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 34
trees = 1309
min_n = 5
tree_depth = 10
learn_rate = 0.0106445
loss_reduction = 0.000127
sample_size = 0.989
Computational engine: xgboost
Verificando importância dos atributos:
Modelo final:
Podemos ver a partir da curva roc que o xgboost obteve melhor perfomance que o random forest e a árvore de decisão. Interessante que o xgboost sem tunar os hiperparâmetros loss_reduction e sample_size se saiu discretamente melhor que o xgboost2 onde efetuamos o tuning desses dois hiperparâmetros. Sendo assim nosso modelo final será o xgb_final.
Vamos então finalizar nosso modelo campeão e scorar a base de validação para efetuar a submissão:
xgboost_modelo_final
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 34
trees = 1309
min_n = 5
tree_depth = 10
learn_rate = 0.0106
Computational engine: xgboost
Matriz de confusão:
adult_val %>%
transmute(resposta = factor(resposta, levels = c(">50K", "<=50K")),
more_than_50k = ifelse(more_than_50k > 0.5, ">50K", "<=50K") %>%
factor(levels = c(">50K", "<=50K"))) %>%
table() %>%
caret::confusionMatrix()
Confusion Matrix and Statistics
more_than_50k
resposta >50K <=50K
>50K 2505 1341
<=50K 719 11716
Accuracy : 0.8735
95% CI : (0.8683, 0.8785)
No Information Rate : 0.802
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6286
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7770
Specificity : 0.8973
Pos Pred Value : 0.6513
Neg Pred Value : 0.9422
Prevalence : 0.1980
Detection Rate : 0.1539
Detection Prevalence : 0.2362
Balanced Accuracy : 0.8371
'Positive' Class : >50K
Selecionando campos no formato da submissão:
submissao <- adult_val %>% select(id, more_than_50k)
write_csv(submissao, "submissao.csv")